MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations

Abstract Motivation The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models’ versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. Results We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM’s self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. Availability and implementation Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.


Property Prediction Head
Below are the details of the prediction head we append to the base molecular encoder within our model to adapt to the property prediction task.
The main motivations behind the prediction head layer involve reducing features to learn the most important features for property prediction in the linear layers, ensuring a uniform distribution of outputs with the normalization layers for training stability, and considering dropout to avoid both underfitting and overfitting (Liu et al., 2023c).We perform a grid search to obtain the final prediction head layer dimensions and hyperparameters.

Use of RDKit
We use RDKit to represent molecules from SMILES strings and obtain their graph structure prior to transforming their structures to our specific graph format with PyTorch Geometric.The motivation behind having an upper threshold of root-mean-square deviation after performing molecular augmentation is to ensure that the augmentation does not result in a molecule that is too semantically different for use in contrastive learning.

Molecule Augmentation Examples
We utilize four different molecule augmentations during our pre-training process: Node Dropping, Random Subwalk, Chemical Transformation, and Subgroup Removal.We chose these augmentations because they edit the molecules in a manner in which we believe the semantic meaning of the molecule is kept, but it is still augmented enough for the model to be more robust and explore the latent space.A more detailed explanation of each of these processes can be found in section 3.3.1.Figure 1 depicts five examples of how these augmentations transform an original molecule into its augmented versions.Let's look at a few examples.The first row shows the molecule betaine.We can observe the effect of node dropping, as one of the carbons has been dropped from the nitrogen.The random subwalk began on the carbon atom, which was connected to the oxygen atoms, and traversed through the graph.The chemical transformation added an amine group to one of the carbon atoms.Subgroup removal retained only one subgroup of the betaine, based on BRICS decomposition.Another molecule we can examine is in the fourth row: acetylphosphate.We can see that node dropping caused an oxygen to be dropped from the phosphorus atom.The random subwalk has taken a walk of the molecular graph, starting from the oxygen atom that is double-bonded to the phosphorus atom.The chemical transformation has introduced an amine group to an oxygen atom.Subgraph removal has left us with a random subgroup of the graph, as dictated by BRICS decomposition.

Cross-modality Matching
We utilize a contrastive loss between the graph representations and text representations for fine-tuning.See Equation 1, where m is the margin controlling how far apart embeddings of negative pairs should be relative to positive pairs, and δ ij is the Kronecker delta function.max 0, m + cos(x graph,i , x text,j ) − δ ij cos(x graph,i , x text,i ) (1) Specifically, we utilize m = 0.2 throughout all of our fine-tuning for this task.We fine-tune for 60 epochs with a batch size of 64 at a learning rate of 5 • 10 −5 for each subtask.

Property Prediction
For the fine-tuned tasks, which are all classification tasks, we utilize a Binary Cross-Entropy loss.See Equation 2, where N is the number of samples, ŷi is the predicted value, and y i is the target label.
We fine-tune for 200 epochs with a batch size of 32 with a learning rate of 10 −4 .See Appendix 2 for details on the linear prediction head for the classification task.

Molecule Caption
We utilize a Cross-Entropy Loss between the predicted captions and target captions.See Equation 3 for details on this loss, where p(y ij |x i ) is the model's probability for a token given the input concatenated SMILES string embedding and molecule embedding, We fine-tune for 10 epochs with a batch size of 16 with a learning rate of 10 −4 .

Molecule Editing
Finally, for the molecule editing task, we do not fine-tune our model, as the task involves de novo generation.Instead, we optimize the predicted molecule such that its embedding through MolLM is more aligned with the text prompt's embedding within the MolLM latent space.We utilize a mean squared error (MSE) between the predicted molecule embedding and the original molecule embedding to maintain similarity to the original molecule.See Equation 4for the MSE loss, where z pred is the predicted molecule embedding, z orig is the original molecule embedding, and N is the number of samples.
Additionally, we incorporate a Dot Product-based loss to ensure alignment between the predicted molecule's embedding and the text prompt's embedding.See Equation 5for the Dot Product-based loss, where ztext is the text prompt's embedding.
This dual approach, with a linear combination of these two losses for similarity retention and alignment to the prompt, ensures that the edited molecules are related to the original while being edited in a manner relevant to the prompt.

Embedding Translation MLP
The weights of this MLP for projection from MoFlow latent space to MolLM latent space were trained to minimize cosine similarity between the translated embeddings and the actual MolLM embedding for molecules within the ZINC (Sterling and Irwin, 2015) dataset.The cosine similarity loss for N pairs, where z pred is the predicted translation and ztrue is the true embedding through MolLM's molecule encoder, is

Model Architecture Explanation
Our choice of model architecture was based on our selection of molecular representations and the advantages offered by Transformers in encoding molecules.For molecular representation, we examined prior works to inform our approach.While SMILES and SELFIES strings are simple and directly leverage natural language pipelines, previous literature has highlighted limitations in these representations.For instance, SMILES strings can represent the same molecule in multiple ways, leading to unnecessary data redundancy.Additionally, both SMILES and SELFIES representations lack the ability to encode 2D/3D spatial information, which leads to decreased performance in certain tasks.Thus, we chose to focus on graph-based molecular representations.Our reasoning for continuing to use Transformers to encode molecules stems from the architecture's superior capability to handle sequences.In contrast to traditional GNN or graph isomorphism network (GIN) models, which primarily emphasize the graph structure of molecules, Transformers allow for more flexible encoding.The attention mechanism within Transformers captures molecular patterns, while the encoder-decoder structure allows us to easily leverage multimodal inputs for pre-training.Consequently, we can effectively utilize textual data as inputs, enabling our model to understand domain-specific knowledge associated with a given molecule.

Property Prediction Baseline Models
We briefly describe the method of the models that we compare against the property prediction task.Random Forest (RF) utilizes ensembles of decision trees (Ho, 1995).For this task, Zhu et al. (2021) uses molecular fingerprints, including key chemical properties and structural features, as input features for RF.They do not specify their hyperparameters for RF.RXNFP is a Transformer-based model trained on chemical reactions given as text (Schwaller et al., 2021).Zeng et al. (2022) utilizes BERT (Devlin et al., 2018) without pre-training.SMI-BERT is a BERT model pre-trained only on SMILES strings (Zeng et al., 2022).Graph convolutional networks (GCNs) use convolution-like neighborhood aggregations to learn graph node and edge features (Kipf and Welling, 2016), which can be applied to molecular graphs.GINs utilize a more powerful learnable aggregation function than GCNs for enhanced expressivity (Xu et al., 2018).KPGT introduces a Transformer model that utilizes line graphs generated from molecular graphs and additional knowledge from RDKit fingerprints (Li et al., 2022).KANO integrates a chemical element knowledge graph into graph augmentations and utilizes a communicative message passing neural network architecture, which employs a directional approach to message passing between nodes (atoms) and a more dynamic approach to updating edge representations than traditional GNNs (Fang et al., 2023).MoLFormer-XL utilizes a Transformer model with rotary positional embeddings and linear attention on SMILES strings (Ross et al., 2022).

Molecule Editing Error Magnitude
We provide Figure 2 with the distribution of the measured respective metric across the generated molecules for each molecule editing prompt.We can not directly compare to MoleculeSTM at the time of writing this manuscript due to issues with their fine-tuned checkpoints and the fact that their output is not publicly available.Generally, the molecules that are edited in the opposite of the desired direction for the metric are clustered around no change (0 change in the metric).This indicates that most of the unsuccessfully edited molecules generally do not move the metric value too far in the wrong direction.For the worst-performing prompts, we attribute these to the innate complexity of the property and the frequency of its appearance in the training text corpus.For instance, changing an arbitrary molecule into a drug-like molecule is a significantly more difficult task than increasing its solubility where a small edit such as swapping one functional group may be sufficient.
Through the error analysis conducted by our recruited professionals, we provide an exploration of the worst results of our molecular edits in response to given prompts, including more hydrogen bond acceptors, hydrogen acceptors, increased drug likeness, and high permeability.Despite our model's efforts to modify the structures of molecules in response to the given prompts, the results proved unsuccessful across three instances (depicted in Figures 3, 4, and 5).In the case aiming for enhanced hydrogen bond acceptor capacity (Figure 3), the model removed the sulfonamide group and nitrogen within a thiocyanate-like structure, reducing the hydrogen bond accepting ability and hence contradicting the intended goal.The attempt to increase drug-likeness (Figure 4) led to the introduction of a triazine-like structure and a chlorinated diazene.Both modifications are less favorable due to potential stability and reactivity concerns, resulting in a decreased quantitative estimate of drug-likeness (QED).Finally, the endeavor to improve molecule permeability (Figure 5) resulted in increased polarity through the introduction of multiple amine and carboxyl groups.The increase in polarity inadvertently made the molecule more soluble in water but less able to cross lipid membranes via passive diffusion, thereby diminishing the molecule's permeability.These examples demonstrate the challenges that can arise for sophisticated molecular edits, which require a nuanced understanding of the interplay between different molecular properties.

Future Direction
In terms of future directions, there is an evident need for more extensive and better-curated datasets.Having larger datasets with a stronger correlation between text and molecular representations would enable the model to establish clearer distinctions among molecules.Automatically identifying such text in large text corpora poses challenges because the mention of molecules in certain texts does not guarantee their relevance.Thus, an improvement in the current text sampling method is necessary.For example, future studies could leverage language models to sift through academic texts or include additional quantitative data from other sources.Fig. 2: Plots of the distribution of the measured metrics for the generated molecules from each instance of the editing task.The (+) and (-) at the end of each plot title indicate whether a higher or lower value of the metric is desired.Green indicates the set of molecules that move the metric in the desired direction while red represents those that do not.We additionally clarify that the desired direction change of the metric is opposite that of the phrasing of the prompt for both permeability and solubility.For instance, a lower TPSA value is desired when prompting for "higher permeability." Exploring improvements in downstream tasks is also important.Introducing novel downstream tasks could allow us to validate the model's applicability across a broader spectrum of challenges in molecular biology and chemistry.Additionally, there exist numerous potential directions for improving the downstream tasks utilized by MolLM.For example, a major weakness of the molecule generation task lies in the fact that it sometimes generates infeasible molecules.Future works could aim to validate these generated molecules for chemical validity, chemical stability, or feasibility of synthesis.This could be achieved by employing techniques such as reinforcement learning from human feedback (Ziegler et al., 2019;Stiennon et al., 2020), creating a reward function based on expert human knowledge to fine-tune the generation model.Leveraging transfer learning from models like GPT-4 could also aid in this improvement.

Prompt
This molecule has more hydrogen bond acceptors.
Original The sulfonamide group, S(=O)(=O), and nitrogen within the thiocyanate-like structure are removed which reduces hydrogen acceptor capability.The nitrogen now carries a positive charge [n+] which makes it less available for hydrogen bonding.The addition of Br is not helpful as it is not typically a strong hydrogen bond acceptor due to its lower electronegativity.For our molecule editing task, future work could explore evaluating how well MoleculeSTM, our model, and other similar models allow for dynamic adjustment of the weighting between emphasizing retaining original molecular features and chemical changes that favor the given prompt.The motivation behind this is that there is likely often the case where a molecule already has many desirable properties and minimal editing is wanted to achieve a prompt.Moreover, a better metric could even consider the ratio of change in the molecule in favor of the prompt to how much the molecule is chemically altered.Our experiments also shed light on the lack of reliable baselines, especially in de novo generation, where there is no gold standard.This poses questions about the veracity and stability of generated molecules and presents the opportunity for the creation and adoption of better baselines.
Additionally, the editing task highlights the usefulness of incorporating natural language prompts as input into the model for powerful biomedicalrelated tasks.Exploring methods to enhance the robustness of this editing task, experimenting with various prompts, and proposing new tasks that involve natural language prompts could lead to more powerful tools leveraging the utility and ease of using natural language prompts.Furthermore, the pre-training technique could be expanded upon in a manner more advanced than just curating larger or higher-quality datasets.There is also an avenue to explore utilizing much larger language models, such as ChatGPT (Jahan et al., 2023) or LLaMA, as agents for biomedical tasks.
Finally, exploring different molecular encoding methods could yield promising results.For example, while Transformer-M combines 2D/3D data linearly, future works could experiment with non-linear combinations of 2D and 3D data.This approach could enhance expressiveness, enabling the model to better discern the nuances between 2D and 3D representations.

Prompt
This molecule has high permeability.
Fig. 1: Examples of different molecule augmentations.For each molecule, their SMILES string, original molecular graph, and the molecular graph of each of the four augmentations are shown.

Fig. 3 :
Fig. 3: An Example of an unsuccessful molecule edit with its prompts, original molecule, edited molecule, and a discussion of the failures in the edit.

Fig. 4 :
Fig. 4: An Example of an unsuccessful molecule edit with its prompts, original molecule, edited molecule, and a discussion of the failures in the edit.
Fig. 5: An Example of an unsuccessful molecule edit for high permeability with its prompts, original molecule, edited molecule, and a discussion of the failures in the edit.